1. Bug fix. 2. add fast long retention implement by veya2ztn · Pull Request #25 · syncdoth/RetNet

veya2ztn · 2023-10-16T05:19:15Z

Fix bug for the mode with inputs_embedding rather than inputs_ids

if inputs_embeds is None:
    inputs_embeds = self.forward_embedding(input_ids, forward_impl, inputs_embeds,past_key_values)
else:
    if forward_impl == 'recurrent':
        inputs_embeds = inputs_embeds[:, -1:]

Add fix length seq arguement when the inputs is (addtional_token, pask_kv)

if fixed_seq_len:slen=fixed_seq_len

Cached the fixed retnet_rel_pos ( thus does not need generate runtimely)
add fast retention implement when the sequence length >> D**2.
See https://github.com/veya2ztn/fast_retention

5.1 I set use_glu defaut to false, thus consistancy to old code.
5.2 The layer norm setting in FFN seem wrong, the self.embed_dim should be ffn_dim

if subln:
    if use_rms_norm:
        self.ffn_layernorm = RMSNorm(self.embed_dim, eps=layernorm_eps)
    else:
        self.ffn_layernorm = LayerNorm(self.embed_dim, eps=layernorm_eps)
else:
    self.ffn_layernorm = None

Anyway, I roll back to self.ffn_layernorm = LayerNorm(ffn_dim, eps=layernorm_eps) if subln else None

…2. Add fix length seq arguement when the inputs is (addtional_token, pask_kv) 3. add fast retention implement when the sequence length >> D**2

syncdoth · 2023-10-16T12:31:36Z

retnet/modeling_retnet.py

        x = self.dropout_module(x)
        return x

-


2 empty lines between classes

syncdoth · 2023-10-16T12:31:45Z

retnet/modeling_retnet.py

        x = self.dropout_module(x)
        return x

-


2 empty lines between classes

syncdoth · 2023-10-16T12:34:22Z

retnet/modeling_retnet.py

        self.dropout_module = torch.nn.Dropout(dropout)
        self.fc1 = nn.Linear(self.embed_dim, ffn_dim)
        self.fc2 = nn.Linear(ffn_dim, self.embed_dim)
-        if subln:


I would like to keep the use_rms_norm. Also, I would prefer if-else instead of tertiary here. If you want tertiary, could you make it sth like:

norm_class = RMSNorm if use_rms_norm else LayerNorm self.fnn_layernorm = norm_class(ffn_dim, eps=layernorm_eps) if subln else None

The embed_dim should be replace by ffn_dim, I think

if subln: if use_rms_norm: self.ffn_layernorm = RMSNorm(self.embed_dim, eps=layernorm_eps) else: self.ffn_layernorm = LayerNorm(self.embed_dim, eps=layernorm_eps) else: self.ffn_layernorm = None

to

if subln: if use_rms_norm: self.ffn_layernorm = RMSNorm(ffn_dim, eps=layernorm_eps) else: self.ffn_layernorm = LayerNorm(ffn_dim, eps=layernorm_eps) else: self.ffn_layernorm = None

syncdoth · 2023-10-16T12:34:55Z

retnet/modeling_retnet.py

+
        # multi-head
        q, k, v = split_heads((q, k, v), B, T, self.num_heads)
-        k *= self.scaling  # for scaled dot product


what's the reasoning for this change?

syncdoth · 2023-10-16T12:35:18Z

retnet/modeling_retnet.py

            - "prev_key_value"  # bsz * num_head * v_dim * qk_dim
            - "scale"  # (1 or bsz) * num_head * 1 * 1
-        decay_mask,  # 1 * num_head * chunk_size * chunk_size
+        decay_mask,   # 1 * num_head * chunk_size * chunk_size


let's keep the spaces consistent.

syncdoth · 2023-10-16T12:47:02Z

retnet/modeling_retnet.py

        self.config = config
        self.embed_dim = config.decoder_embed_dim
        self.dropout_module = torch.nn.Dropout(config.dropout)
+        self.drop_path = DropPath(np.linspace(0, config.drop_path_rate, config.decoder_layers)[depth]) if config.drop_path_rate > 0 else None


I prefer previous code. This one-liner is too long and breaks the 100 character limit.

syncdoth · 2023-10-16T12:47:31Z

retnet/modeling_retnet.py

        ways within their own init.
        """
-        pass
+        #pass


Reason for adding this again?

syncdoth · 2023-10-16T12:47:50Z

retnet/modeling_retnet.py

            hidden_states = F.pad(hidden_states, (0, 0, 0, padding_len))
        else:
            slen = seq_length
+        if fixed_seq_len:slen=fixed_seq_len


no one-liner for if statement.

syncdoth · 2023-10-16T12:50:48Z

retnet/modeling_retnet.py

                                                    forward_impl=forward_impl,
                                                    recurrent_chunk_size=recurrent_chunk_size,
                                                    retention_mask=retention_mask,
-                                                    get_decay_scale=not self.training)


Why do we want decay scale during training?

Below is an example for one parallel output and one recurrent output

with torch.inference_mode(): #<--almost equal to `torch.no_grad()` model.eval() # <-- this disable dropout and batchnorm or other layers that behave differently during inference out = model(old_inputs, forward_impl='parallel' #<-- this line indicates parallel mode use_cached=True,#<-- must have use_cached = True **args,) past_kv = out.past_key_values model.train() # if want train later token, one need reactivate it here. out = model(new_inputs, forward_impl='recurrent' #<-- this line indicates recurrent mode use_cached=True,#<-- must have use_cached = True for further recurrent mode past_key_values=past_kv, # this line must **args )

If we don't have the model.eval() the recurrent mode fail to generate by take the scale=None.
However, the model.eval() will change the behavior of some layer.

There is no other important reason here, just for me convenience.

Basically, the goal is to generate a cache first and reuse it many times
cache --> task_1
cache --> task_2

syncdoth · 2023-10-16T12:51:19Z

retnet/modeling_retnet.py

            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
-        )
+        )


The file should ideally have a trailing whiteline (Some PEP standard)

syncdoth · 2023-10-16T12:53:10Z

Other than some formatting and refactoring issues, I love the fast-retention implementation! I was hoping to get into that. Thanks for your work!

pkpro · 2023-10-31T20:13:46Z

Will this be merged?

syncdoth · 2023-11-03T06:48:33Z

There are some code styling issues and some things I don't understand fully. I think it's great to have its own branch for now.

1.Fix bug for the mode with inputs_embedding rather than inputs_ids. …

54bf297

…2. Add fix length seq arguement when the inputs is (addtional_token, pask_kv) 3. add fast retention implement when the sequence length >> D**2

syncdoth requested changes Oct 16, 2023

View reviewed changes

better chunkwise retention

5e84e56

Conversation

veya2ztn commented Oct 16, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

There is no other important reason here, just for me convenience.

Basically, the goal is to generate a cache first and reuse it many times cache --> task_1 cache --> task_2

Uh oh!

Choose a reason for hiding this comment

Uh oh!

syncdoth commented Oct 16, 2023

Uh oh!

pkpro commented Oct 31, 2023

Uh oh!

syncdoth commented Nov 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Basically, the goal is to generate a `cache` first and reuse it many times
cache --> task_1
cache --> task_2